Question 1: Reading in the Gapminder Data into R

GapminderData <- read_csv(file = Gapminder_Filelink) %>%
  as_tibble(show_col_types = FALSE) %>%
  select(-`...1`)

What we see here is the Gapminder dataset (even though it says it’s cleaned it’s not….). This dataset details various metrics, ranging from economic to agriculture, that describes specific countries within the world over time.

Question 2: Filter Gapminder dataset by year of 1962 and make a scatter plot

filtered_year = 1962
GapminderFilteredYear <- GapminderData %>% 
  dplyr::filter(Year == filtered_year)

Question 2 (cont.): Make a scatter plot of the filtered dataset based on CO2 emssions and gdpPercap

ScatterPlot <- GapminderFilteredYear %>%
  ggplot(., aes(x = gdpPercap, 
                y = `CO2 emissions (metric tons per capita)`)) + 
  geom_point() +
  theme_classic()
## Warning: Removed 151 rows containing missing values (geom_point).

From our dataset, we can see that there is a positively linear relationship between CO2 emissions and GDP per capita. Now lets investigate further on how strong the correlation is based on the pearson correlation (R) coefficient.

Question 3: On the filtered data, calculate the pearson correlation of ‘CO2 emissions (metric tons per capita)’ and gdpPercap. What is the Pearson R value and associated p value?

test = "pearson"
rm_na = "complete.obs"
pearson_corr = cor(GapminderFilteredYear$`CO2 emissions (metric tons per capita)`,
    GapminderFilteredYear$gdpPercap, 
    method = test,
    use = rm_na) * 100
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 151 rows containing non-finite values (stat_smooth).
## Warning: Removed 151 rows containing non-finite values (stat_cor).
## Warning: Removed 151 rows containing missing values (geom_point).

## The pearson correlation coefficient between CO2 emissions and GDP per capita is 92.6%

From what we can see here, the pearson correlation coefficient is approximately 92.61%, meaning that there is a strong positive correlation between CO2 emissions and GDP per capita in all countries in the year of 1962. In addition, the p-value (2.2 * 10^-6) is less than 0.05, meaning that the correlation of the two variables are significant to one another. Now lets take a look at all years and see which has the highest pearson correlation coefficient.

Question 4: On the unfiltered data, answer “In what year is the correlation between ‘CO2 emissions (metric tons per capita)’ and gdpPercap the strongest?” Filter the dataset to that year for the next step…

test = "pearson"
rm_na = "complete.obs"
CO2_year = vector(mode = "list")
gdpPercap_year = vector(mode = "list")
PearsonCorrYears = vector(mode = "list")
YearChartoNum = vector(mode = "list")



GapminderYear = GapminderData %>% #selecting the all the unique years iteration
  select(Year) %>% 
  unique() %>% 
  pull() %>%
  as.character() #For names in the list

PearsonCorrYears = GapminderYear %>% #Make into a list by iterating through the years
  sapply(.,
         USE.NAMES = TRUE, 
         simplify = FALSE,
         function(year){
           
           YearChartoNum[[year]] = year %>% #Convert characters to numeric values
             as.numeric()
           

           CO2_year[[year]] = GapminderData %>% #list for the CO2 emissions by year
             filter(Year == YearChartoNum[[year]]) %>%
             select(`CO2 emissions (metric tons per capita)`) %>%
             pull()

           gdpPercap_year[[year]] = GapminderData %>% #list for the GDP per capita by
              filter(Year == YearChartoNum[[year]]) %>%                  #year
              select(gdpPercap) %>%
              pull()

            cor(x = GapminderData %>% #Pearson Correlation coefficient iterated by year
                  filter(Year == YearChartoNum[[year]]) %>%
                  select(`CO2 emissions (metric tons per capita)`) %>%
                  pull(),
                y = gdpPercap_year[[year]],
               method = test,
               use = rm_na)
            
            }) %>% unlist()
##      1967      1962      1972      1982      1987      1992      1997      2002 
## 0.9387918 0.9260817 0.8428986 0.8166384 0.8095531 0.8094316 0.8081396 0.8006421 
##      1977      2007 
## 0.7928336 0.7204169

After iterating over the years in the Gapminder dataset, we can see that the highest Pearson correlation coefficient occurs in 1967 suggesting that year has the strongest correlation (93.88%) between CO2 emissions and GDP per capita. Now lets filter the Gapminder dataset again with that year and plot a scatterplot through plotly.

Question 5: Using plotly, create an interactive scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent. You can easily convert any ggplot plot to a plotly plot using the ggplotly() command.

PearsonCorrMaxYear = PearsonCorrYears[which.max(PearsonCorrYears)] %>%
  names() %>% 
  as.numeric() #Finding the max year for the analysis

GapminderFilteredMax = GapminderData %>% ##Filter by year with the highest Pearson
  filter(Year == PearsonCorrMaxYear)     ##correlation coefficient of CO2 and GDP

GapminderMaxplot = GapminderFilteredMax %>% ## ggplot implementation
  ggplot(., aes(x =`CO2 emissions (metric tons per capita)`,
                y = gdpPercap,
                size = pop)) +
  geom_point() + theme_classic()

Here is the plotly implementation of the Gapminder dataset during the year of 1962. The scatterplot is interactive and you can see the different values (gdpPercap, CO2 emissions) in each point of the plot.

New Question 1: What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’? (stats test needed)

GapminderContinentEnergyUse <- GapminderData %>% 
  select(continent,`Energy use (kg of oil equivalent per capita)`) %>% 
  na.omit()
ggboxplot(GapminderContinentEnergyUse, 
          x = "continent",
          y = "Energy use (kg of oil equivalent per capita)",
          color = "continent",
          add = "jitter",
          shape = "continent")

lm(formula = `Energy use (kg of oil equivalent per capita)` ~ continent,
   data = GapminderContinentEnergyUse) %>% summary()
## 
## Call:
## lm(formula = `Energy use (kg of oil equivalent per capita)` ~ 
##     continent, data = GapminderContinentEnergyUse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2796.0 -1107.5  -349.1   276.8 12904.4 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          698.5      137.2   5.090 4.42e-07 ***
## continentAmericas   1005.1      196.9   5.105 4.10e-07 ***
## continentAsia       1168.8      197.7   5.911 4.93e-09 ***
## continentEurope     2447.5      183.0  13.377  < 2e-16 ***
## continentOceania    3281.8      454.1   7.227 1.11e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1936 on 843 degrees of freedom
## Multiple R-squared:  0.1963, Adjusted R-squared:  0.1924 
## F-statistic: 51.46 on 4 and 843 DF,  p-value: < 2.2e-16

New Question 2: Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990? (stats test needed)

New Question 3: What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

GapminderPopDensity <- GapminderData %>% 
  select(`Country Name`, 
         `Population density (people per sq. km of land area)`,
         Year) 
GapminderPopDensity  %>% 
  ggplot(., 
         aes(x = Year, 
              y = `Population density (people per sq. km of land area)`,
              color = `Country Name`)) +
  geom_point()
## Warning: Removed 49 rows containing missing values (geom_point).